-
Stage 1: Linear projection of flatterned non-overlapping patches as their respective token/feature, then feed these tokens added with position embedding to first two successive Swin Transformer Blocks
-
Stage 2&3&4: The first patch merging layer concatenates the features of each group of \( 2 \times 2 \) neighboring patches, then applies a linear layer on the concatated features to downsample feature channels 2 times, and last feed these processed features to stacked two consecutive Swin Transformer Blocks
-
hierarchical feature maps by merging image patches in deeper layers
-
linear computation complexity to input image size due to computation of self-attention only within each local window
-
window multi-head self attention(W-MSA)
-
W-MSA for linear computation complexity to image size
\begin{align}
&\Omega(MSA)=4hwC^2 + 2(hw)^2C \\
&\Omega(W\mbox{-}MSA)=4hwC^2 + 2M^2hwC \\
\end{align}
-
shifted window multi-head self attention(SW-MSA)
-
SW-MSA for augement W-MSA by introducing connections between neighboring non-overlapping windows in the previous layer
\begin{align}
&\hat{z}^l=W\mbox{-}MSA(LN(z^{l-1}))+z^{l-1} \\
&z^l=MLP(LN(\hat{z}^l))+\hat{z}^l \\
&\hat{z}^{l+1}=SW\mbox{-}MSA(LN(z^l))+z^l \\
&z^{l+1}=MLP(LN(\hat{z}^{l+1}))+\hat{z}^{l+1} \\
\end{align}
-
cyclic-shifting toward the top-left direction
-
a batched window may be composed of several sub-windows that are not adjacent in the feature map, so a masking mechanism is employed to limit self-attention computation to within each sub-window
-
relative position bias
\[
Attention(Q,K,V)=SoftMax(QK^T/\sqrt{d}+B)V
\]
where \( Q, K, V\in \mathbb{R}^{M^2\times d} \) are the query, key and value matrices; d is the query/key dimension, and \( M^2 \) is the number of patches in a window